Efficient Unsupervised Authorship Clustering Using Impostor Similarity
نویسندگان
چکیده
Some real-world authorship analysis applications require techniques that scale to thousands of documents with little or no a priori information about the number of candidate authors. While there is extensive research on identifying authors given a small set of candidates and ample training data, almost none is based on real-world applications of clustering documents by authorship, independent of prior knowledge of the authors. We adapt the impostor method of authorship verification to authorship clustering using agglomerative clustering and, for efficiency, locality sensitive hashing. We validate our methods on a publicly available blog corpus that has been used in previous authorship research. We show that our efficient method matches previous results for authorship verification on the blog corpus. We extend previous results to show that the impostor algorithm is robust to incorrectly selected impostors. On the authorship clustering task, we show that the impostor similarity method clearly outperforms other techniques on the blog corpus.
منابع مشابه
Automated unsupervised authorship analysis using evidence accumulation clustering
Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of ...
متن کاملAuthorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013
This paper describes the evaluation of the GenIM method, which participated in the PAN' 13 authorship identification competition. The approach is based on comparing the similarity between the given documents and a number of external (impostor) documents, so that documents can be classified as having been written by the same author, if they are shown to be more similar to each other than to the ...
متن کاملBotOnus: an online unsupervised method for Botnet detection
Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...
متن کاملClustering by Authorship Within and Across Documents
The vast majority of previous studies in authorship attribution assume the existence of documents (or parts of documents) labeled by authorship to be used as training instances in either closed-set or open-set attribution. However, in several applications it is not easy or even possible to find such labeled data and it is necessary to build unsupervised attribution models that are able to estim...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کامل